This is a fast-paced course that covers a lot of material. There will be a large amount of references. You may need to do your own research to fill in the gaps in between lectures and homework/projects. It is impossible to learn data science without getting your hands dirty. Please budget your time evenly. Last-minute work ethic will not work for this course.
Homework in this course is different from your usual homework assignment as a typical student. Most of the time, they are built over real case studies. While you will be applying methods covered in lectures, you will also find that extra teaching materials appear here. The focus will be always on the goals of the study, the usefulness of the data gathered, and the limitations in any conclusions you may draw. Always try to challenge your data analysis in a critical way. Frequently, there are no unique solutions.
Case studies in each homework can be listed as your data science projects (e.g. on your CV) where you see fit.
R-studio and
RMarkdowndplyrggplotHomework assignments can be done in a group consisting of up to three members. Please find your group members as soon as possible and register your group on our Canvas site.
All work submitted should be completed in the R Markdown format. You can find a cheat sheet for R Markdown here For those who have never used it before, we urge you to start this homework as soon as possible.
Submit the following files, one submission for each group: (1) Rmd file, (2) a compiled HTML or pdf version, and (3) all necessary data files if different from our source data. You may directly edit this .rmd file to add your answers. If you intend to work on the problems separately within your group, compile your answers into one Rmd file before submitting. We encourage that you at least attempt each problem by yourself before working with your teammates. Additionally, ensure that you can ‘knit’ or compile your Rmd file. It is also likely that you need to configure Rstudio to properly convert files to PDF. These instructions might be helpful.
In general, be as concise as possible while giving a fully
complete answer to each question. All necessary datasets are available
in this homework folder on Canvas. Make sure to document your code with
comments (written on separate lines in a code chunk using a hashtag
# before the comment) so the teaching fellows can follow
along. R Markdown is particularly useful because it follows a ‘stream of
consciousness’ approach: as you write code in a code chunk, make sure to
explain what you are doing outside of the chunk.
A few good or solicited submissions will be used as sample solutions. When those are released, make sure to compare your answers and understand the solutions.
dplyr and
ggplot)How successful is the Wharton Talk Show Business Radio Powered by the Wharton School
Background: Have you ever listened to SiriusXM? Do you know there is a Talk Show run by Wharton professors in Sirius Radio? Wharton launched a talk show called Business Radio Powered by the Wharton School through the Sirius Radio station in January of 2014. Within a short period of time the general reaction seemed to be overwhelmingly positive. To find out the audience size for the show, we designed a survey and collected a data set via MTURK in May of 2014. Our goal was to estimate the audience size. There were 51.6 million Sirius Radio listeners then. One approach is to estimate the proportion of the Wharton listeners to that of the Sirius listeners, \(p\), so that we will come up with an audience size estimate of approximately 51.6 million times \(p\).
To do so, we launched a survey via Amazon Mechanical Turk (MTurk) on May 24, 2014 at an offered price of $0.10 for each answered survey. We set it to be run for 6 days with a target maximum sample size of 2000 as our goal. Most of the observations came in within the first two days. The main questions of interest are “Have you ever listened to Sirius Radio” and “Have you ever listened to Sirius Business Radio by Wharton?”. A few demographic features used as control variables were also collected; these include Gender, Age and Household Income.
We requested that only people in United States answer the questions. Each person can only fill in the questionnaire once to avoid duplicates. Aside from these restrictions, we opened the survey to everyone in MTurk with a hope that the sample would be more randomly chosen.
The raw data is stored as Survey_results_final.csv on
Canvas.
Select only the variables Age, Gender, Education Level, Household Income in 2013, Sirius Listener?, Wharton Listener? and Time used to finish the survey.
Change the variable names to be “age”, “gender”, “education”, “income”, “sirius”, “wharton”, “worktime”.
As in real world data with user input, the data is incomplete, with missing values, and has incorrect responses. There is no general rule for dealing with these problems beyond “use common sense.” In whatever case, explain what the problems were and how you addressed them. Be sure to explain your rationale for your chosen methods of handling issues with the data. Do not use Excel for this, however tempting it might be.
Tip: Reflect on the reasons for which data could be wrong or missing. How would you address each case? For this homework, if you are trying to predict missing values with regression, you are definitely overthinking. Keep it simple.
Before this point, we thought there were just blanks and NAs in the data, but discovered there were unchanged entire fields in the form os “select one” for at least the education column and incorrect values in the age column. Here we tabulate these missing entries.
| age | gender | education | income | sirius | wharton | worktime | |
|---|---|---|---|---|---|---|---|
| Missing or NA Values | 1 | 6 | 0 | 6 | 5 | 4 | 0 |
| Incorrect Values | 5 | 0 | 19 | 0 | 0 | 0 | 0 |
Within the age variable, 1 person did not respond, 2
people selected ages that did not make sense in the context of the
survey (4 and 223), and 2 people wrote their age as a character
“eighteen (18)”, “27`”, and female. We opted to remove the missing
responses, and the incorrect ages; however, for the values “eighteen
(18)” and “27’”, we changed these to their numeric values.
Within the education variable, 19 selected ‘select
one’ and we opted to remove these
Within the gender, income,
wharton, and sirius variables, we removed all
missing values.
We did not remove any values from the worktime
variable since it did not contain any missing or incorrect
values.
Although there we remove 44 total (22 missing and 22 incorrect values) only 37 surveys are removed because some people have multiple variables missing.
Write a brief report to summarize all the variables collected. Include both summary statistics (including sample size) and graphical displays such as histograms or bar charts where appropriate. Comment on what you have found from this sample. (For example - it’s very interesting to think about why would one work for a job that pays only 10cents/each survey? Who are those survey workers? The answer may be interesting even if it may not directly relate to our goal.)
This study received 1,764 surveys. We removed 37 of those after conductiong quality control, resulting in a total of 1,727 surveys. In general, the participants are mostly young, ranging from early to late twenties. Most are Male. The majority have at least attended some college and make mkae below 50,000 dollars. We describe the population of participants in greater detail and by several demographic categories below.
age ranges from 18-76
It is right skewed, with most participants being younger in their early to late twenties
We can see the distribution of ages with a histogram and examine how this distribution is distributed by other variables
Most participants have completed some college education.
| Counts | Percentage | |
|---|---|---|
| Other | 2 | 0.116 |
| Less than 12 years; no high school diploma | 10 | 0.579 |
| High school graduate (or equivalent) | 190 | 11.002 |
| Some college, no diploma; or Associate’s degree | 737 | 42.675 |
| Bachelor’s degree or other 4-year degree | 611 | 35.379 |
| Graduate or professional degree | 177 | 10.249 |
Most participants make below $50,000
| Counts | Percentage | |
|---|---|---|
| Less than $15,000 | 206 | 11.93 |
| $15,000 - $30,000 | 360 | 20.84 |
| $30,000 - $50,000 | 420 | 24.32 |
| $50,000 - $75,000 | 371 | 21.48 |
| $75,000 - $150,000 | 326 | 18.88 |
| Above $150,000 | 44 | 2.55 |
### Business Radio Powered by the Wharton School Listeners
The population from which the sample is drawn determines where the results of our analysis can be applied or generalized. We include some basic demographic information for the purpose of identifying sample bias, if any exists. Combine our data and the general population distribution in age, gender and income to try to characterize our sample on hand.
Does this sample appear to be a random sample from the general population of the USA?
Does this sample appear to be a random sample from the MTURK population?
Note: You can not provide evidence by simply looking at our data here. For example, you need to find distribution of education in our age group in US to see if the two groups match in distribution. You may need to gather some background information about the MTURK population to have a slight sense if this particular sample seem to a random sample from there… Please do not spend too much time gathering evidence.
We use several datasets from the from the US Census Bureau
nc-est2019-sr11h: Annual Estimates of the Resident
Population by Sex, Race, and Hispanic Origin for the United States:
April 1, 2010 toJuly 1, 2019
nc-est2019-agesex: Annual Estimates of the Resident
Population for Selected Age Groups by Sex for the United States: April
1, 2010 to July 1, 2019
data/table-1-01.xlsx: Table 1. Educational
Attainment of the Population 18 Years and Over, by Age, Sex, Race, and
Hispanic Origin: 2014
data/hinc01R_1.xls: HINC-01. Selected
Characteristics of Households, by Total Money Income in 2013. (income
data comes from 2013)
Is the distribution of gender from the survey participants a random sample from the general population of the USA?
Additionally we use data from this paper that sample from a total of 2,026 U.S. adults in mid-December 2019 in the MTURK database to estimate the demographic information of participants in MTURK
Is the distribution of ages from the survey participants a random sample from the general population of the USA?
### Education Is the distribution of education from the survey
participants a random sample from the general population of the USA?
Is the distribution of income from the survey participants a random sample from the general population of the USA?
Give a final estimate of the Wharton audience size in January 2014. Assume that the sample is a random sample of the MTURK population, and that the proportion of Wharton listeners vs. Sirius listeners in the general population is the same as that in the MTURK population. Write a brief executive summary to summarize your findings and how you came to that conclusion.
To be specific, you should include:
Wharton launched a talk show called Business Radio Powered by the Wharton School through the Sirius Radio station in January of 2014. To find out the audience size for the show, a survey was designed and collected a data set via MTURK in May of 2014. The goal was to estimate the audience size.
To do so, launched a survey via Amazon Mechanical Turk (MTurk) was launched on May 24, 2014 at an offered price of $0.05 for each answered survey. WIt was set to be run for 6 days with a target maximum sample size of 2000 as the goal. Most of the observations came in within the first two days. The main questions of interest are “Have you ever listened to Sirius Radio” and “Have you ever listened to Sirius Business Radio by Wharton?”. A few demographic features used as control variables were also collected; these include Gender, Age and Household Income.
It was requested that only people in United States answer the questions. Each person can only fill in the questionnaire once to avoid duplicates. Aside from these restrictions, the survey was open to everyone in MTurk with a hope that the sample would be more randomly chosen.
We Assume that the sample is a random sample of the MTURK population, and that the proportion of Wharton listeners vs. Sirius listeners in the general population is the same as that in the MTURK population.
There were 51.6 million Sirius Radio listeners then. One approach is to estimate the proportion of the Wharton listeners to that of the Sirius listeners, \(p\), so that we will come up with an audience size estimate of approximately 51.6 million times \(p\). Using this method, we estimated that the Wharton audience size in January 2014 is between [1982330, 3189248], specifically 2,585,789.
A major limitation of this study is that the population sampled from MTURK is not a random sample of the US population and differs quite significantly.
Now suppose you are asked to design a study to estimate the audience size of Wharton Business Radio Show as of today: You are given a budget of $1000. You need to present your findings in two months.
Please fill in the google form to list your platform where surveys will be launched and collected HERE
A good proposal will give an accurate estimation with the least amount of money used.
Are women underrepresented in science in general? How does gender
relate to the type of educational degree pursued? Does the number of
higher degrees increase over the years? In an attempt to answer these
questions, we assembled a data set (WomenData_06_16.xlsx)
from NSF
about various degrees granted in the U.S. from 2006 to 2016. It contains
the following variables: Field (Non-science-engineering
(Non-S&E) and sciences (Computer sciences,
Mathematics and statistics, etc.)), Degree
(BS, MS, PhD), Sex
(M, F), Number of degrees granted, and
Year.
Our goal is to answer the above questions only through EDA (Exploratory Data Analyses) without formal testing. We have provided sample R-codes in the appendix to help you if needed.
Notice the data came in as an Excel file. We need to use the package
readxl and the function read_excel() to read
the data WomenData_06_16.xlsx into R.
Read the data into R.
Clean the names of each variables. (Change variable names to
Field,Degree, Sex,
Year and Number )
Set the variable natures properly.
Any missing values?
We can count the number of NA values, if there are any.
In this dataset, there are no missing values.
We can count the amount of unique entries in the Field variable
There are 10 Fields possible: Agricultural sciences; Biological sciences; Computer sciences; Earth, atmospheric, and ocean sciences; Engineering; Mathematics and statistics; Non-S&E; Physical sciences; Psychology; and Social sciences
There are 3 degree types: BS, MS, and PhD
There are statistics for 11 years from 2006 to 2016
Other variables included in this study are Sex, Male or Female, and Number, the number of degrees awarded.
Is there evidence that more males are in science-related fields vs
Non-S&E? Provide summary statistics and a plot which
shows the number of people by gender and by field. Write a brief summary
to describe your findings.
Is there evidence that more males are in science-related fields vs
Non-S&E? –> There are more people overall in
Non-S&E fields, in which there are more women. However, there are
more men in the S&E fields from these plots.
Describe the number of people by type of degree, field, and gender. Do you see any evidence of gender effects over different types of degrees? Again, provide graphs to summarize your findings.
The proportion of non s&e fields and the degree types is relatively the same, but more variability in the s&e degrees. males and females obtain about the same science BS degrees (females slightly more), but males have more science MS and PhDs.
In this last portion of the EDA, we ask you to provide evidence numerically and graphically: Do the number of degrees change by gender, field, and time?
Do the number of degrees change by gender, field, and time? –> Females appear to have more degrees overall at the BS and MS level, but Males have more PhDs in STEM fields, compared to women. Males also have more STEM MS degrees than Females.
Finally, is there evidence showing that women are underrepresented in data science? Data science is an interdisciplinary field of computer science, math, and statistics. You may include year and/or degree.
Overall, we believe there is enough graphical evidence that women are underrepresented in data science related fields. In computer science, that disparity is ony getting worse with time. Overall, there are less people in math and statistics at the BS and MS level; however, with more oerall degress in math a the PhD level, women still get about half the amount of math PhDs as men.
Summarize your findings focusing on answering the questions regarding if we see consistent patterns that more males pursue science-related fields. Any concerns with the data set? How could we improve on the study?
In general, there are more people overall in Non-S&E fields, in which there are more women. However, we consistently see that there are more men in the S&E fields. The proportion of non s&e fields and the degree types is relatively the same, but more variability in the s&e degrees. males and females obtain about the same science BS degrees (females slightly more), but males have more science MS and PhDs.Females appear to have more degrees overall at the BS and MS level, but Males have more PhDs in STEM fields, compared to women. Males also have more STEM MS degrees than Females.
We believe there is enough graphical evidence that women are underrepresented in data science related fields. In computer science, that disparity is ony getting worse with time. Overall, there are less people in math and statistics at the BS and MS level; however, with more oerall degress in math a the PhD level, women still get about half the amount of math PhDs as men.
We can improve on this study by also including people who pursued stem degrees but did not follow through, o2 switched to another field.
We would like to explore how payroll affects performance among Major League Baseball teams. The data is prepared in two formats record payroll, winning numbers/percentage by team from 1998 to 2014.
Here are the datasets:
-MLPayData_Total.csv: wide format
-baseball.csv: long format
Feel free to use either dataset to address the problems.
Payroll may relate to performance among ML Baseball teams. One possible argument is that what affects this year’s performance is not this year’s payroll, but the amount that payroll increased from last year. Let us look into this through EDA.
Create increment in payroll
a). To describe the increment of payroll in each year there are several possible approaches. Take 2013 as an example:
- option 1: diff: payroll_2013 - payroll_2012
- option 2: log diff: log(payroll_2013) - log(payroll_2012)
Explain why the log difference is more appropriate in this setup.
b). Create a new variable
diff_log=log(payroll_2013) - log(payroll_2012). Hint: use
dplyr::lag() function.
c). Create a long data table including: team, year, diff_log, win_pct
a). Which five teams had highest increase in their payroll between years 2010 and 2014, inclusive?
The Los Angeles Dodgers, Miami Marlins, Houston Astros, Kansas City Royals, and the Texas Rangers
| team | year | diff_log | win_pct |
|---|---|---|---|
| Los Angeles Dodgers | 2013 | 0.823 | 0.568 |
| Miami Marlins | 2012 | 0.729 | 0.426 |
| Houston Astros | 2014 | 0.703 | 0.432 |
| Kansas City Royals | 2012 | 0.522 | 0.444 |
| Texas Rangers | 2011 | 0.513 | 0.593 |
b). Between 2010 and 2014, inclusive, which team(s) “improved” the most? That is, had the biggest percentage gain in wins?
The Arizona Diamondbacks, Boston Red Sox, Houston Astros, Cleveland Indians, and Baltimore Orioles
| team | year | diff_log | win_pct | diff_log_win |
|---|---|---|---|---|
| Arizona Diamondbacks | 2011 | -0.124 | 0.580 | 0.369 |
| Boston Red Sox | 2013 | -0.139 | 0.599 | 0.341 |
| Houston Astros | 2014 | 0.703 | 0.432 | 0.317 |
| Cleveland Indians | 2013 | -0.008 | 0.568 | 0.302 |
| Baltimore Orioles | 2012 | -0.046 | 0.574 | 0.298 |
Log increases in payroll have a very weak linear relationship to performance overall.
Is there evidence to support the hypothesis that higher increases in payroll on the log scale lead to increased performance? Pick up a few statistics, accompanied with some data visualization, to support your answer. –> All R^2 are less than 0.5 indicating a weak linear relationship between team-to-team log increase in payroll and their performance.
Which set of factors are better explaining performance? Yearly payroll or yearly increase in payroll? What criterion is being used?
This linear model shows that raw payroll more significantly predicts performance that log payroll, although the R^2 overall is still quite weak.